Introducción

Youtube es una plataforma web dedicada a compartir vídeos de diferentes temas que van desde entretención hasta educación. Comenzó como una simple aplicación web lanzada en 2005 por Chud Hurley, Steven Chen y Jamed Karim para convertirse en una de las redes sociales más utilizadas en el mundo.

Hoy en día la plataforma tiene mas de 1.9 billones de usuarios activos cada mes y se estima que se suben 400 horas de vídeo por cada minuto del día y a su ves los usuarios ven alrededor de 1 billón de horas por día.

La plataforma recomienda vídeos automáticamente, de forma personalizada a cada usuario, y considerando que mas del 70% del tiempo gastado en YouTube es usado viendo lo que la plataforma recomienda, el algoritmo de recomendación es un tema de fundamental importancia para los creadores de contenido de la plataforma que quieren ganar popularidad, visitas, dar a conocer su trabajo e intereses, y muchas veces hacer una carrera en YouTube.

El algoritmo de YouTube busca aumentar el tiempo de vistas se vuelve es sumamente complejo, cambia constantemente y se puede considerar imposible de comprender en su totalidad, especialmente para quienes no trabajan en Google.

Utilizando los conceptos y algoritmos de minería de datos es posible estudiar los datos generados por los usuarios, y así encontrar patrones que determinen los contenidos más vistos y comentados dentro de la plataforma, donde dicha información es valiosa para los autores de contenido.

Descripción de los datos

El dataset utilizado tiene el nombre “Trending YouTube Video Statistics”, podemos encontrarlo en

https://www.kaggle.com/datasnaek/youtube-new

Este incluye diversas publicaciones de datos de los vídeos de Youtube separados por país junto a estadísticas focalizadas en el contenidos de los mismos. El dataset trae 200 vídeos de tendencias listados por día, donde los datos de cada región se encuentran en archivos separados, y en cada archivo se agregan los títulos de vídeos, el canal del vídeo, la fecha de publicación, tags, número de visitas, cantidad de likes y dislikes.

Podemos obtener los siguientes datos para cada registro:

Hipótesis iniciales

Como hipótesis iniciales, creemos que:

Exploración de los datos

La cantidad de datos obtenida por cada región se encuentra en distintos idiomas, entre ellos ruso, inglés, español, francés,japonés, entre otros. Para simplificar la información mostrada en el hito 1 se decidió utilizar la información proveniente de Estados Unidos. La idea del análisis a seguir es generar un predictor de contenido que seleccione los tópicos más interesante dentro de los datos a estudiar.

Cargar datos

library(gdata)
        library(purrr)
        library(ggplot2)
        library(jsonlite)
        data_dir <- ""
        us_videos <- read.csv(paste(data_dir, "USvideos.csv", sep=""))
        us_videos$trending_date <- as.Date(us_videos$trending_date, format = "%y.%d.%m")
        us_videos$title <- as.character(us_videos$title)
        
        us_category <- fromJSON("US_category_id.json")
        us_category <- data.frame(id=us_category$items$id, name=us_category$items$snippet$title)
        us_category$name <- as.character(us_category$name)
        
        category_name <- function(id) {
          return(us_category[us_category$id == id, "name"])
        }
        
        us_videos$category <- map_chr(us_videos$category_id, category_name)

Estadísticas

Cantidad de datos nulos

print(sum(is.na(us_videos)))
## [1] 0

Número de canales distintos

print(length(unique(us_videos$channel_title)))
## [1] 2207

Número de videos distintos (hay videos que son trending más de 1 día)

print(length(unique(us_videos$video_id)))
## [1] 6351

Fechas de publicación

print(min(us_videos$trending_date))
## [1] "2017-11-14"
print(max(us_videos$trending_date))
## [1] "2018-06-14"
print(max(us_videos$trending_date) - min(us_videos$trending_date))
## Time difference of 212 days

Cantidad de categorías distintas

print(length(unique(us_videos$category)))
## [1] 16
ggplot(data.frame(Category=us_videos$category)) +
          geom_bar(aes(x=reorder(Category)), stat="count") +
          coord_flip() +
          ggtitle("Histograma de categorías") +
          xlab("Categorías") +
          ylab("Número de videos")

t <- unique(us_videos[,c("video_id", "trending_date")])
        t <- as.data.frame(table(t["video_id"]))
        t <- as.data.frame(table(t[2]))
        t$Var1 <- as.numeric(t$Var1)
        
        ggplot(data.frame(t)) +
          geom_bar(aes(x = Var1, y = Freq), stat="identity") +
          ggtitle("Frecuencia de videos por la cantidad de días trending") +
          xlab("Cantidad de días") +
          ylab("Frecuencia de videos")

Top 10 canales con mayor cantidad de días trending

t <- unique(us_videos[,c("trending_date", "channel_title")])
        t <- as.data.frame(table(t["channel_title"]))
        t <- t[order(t$Freq, decreasing=TRUE),]
        t[1:10,]
##                                        Var1 Freq
        ## 635                                    ESPN  202
        ## 1936 The Tonight Show Starring Jimmy Fallon  197
        ## 1387                                Netflix  193
        ## 1955                           TheEllenShow  192
        ## 2118                                    Vox  192
        ## 1904     The Late Show with Stephen Colbert  187
        ## 975                       Jimmy Kimmel Live  185
        ## 1102            Late Night with Seth Meyers  183
        ## 1687                         Screen Junkies  182
        ## 1373                                    NBA  181

Frecuencia de cantidad de días trending por canal

t <- data.frame(table(t[2]))
        t$Var1 <- as.numeric(t$Var1)
        ggplot(t) +
          geom_bar(aes(x = Var1, y = Freq), stat="identity") +
          ggtitle("Frecuencia de canales por la cantidad de días trending") +
          xlab("Cantidad de días") +
          ylab("Frecuencia de canales")

Porcentaje de vídeos con una palabra (de largo al menos 4) en mayúscula en el título

has_upper <- function(line) {
          v <- strsplit(line, " ")
          for (s in v) {
            if (length(s) > 3 && s == toupper(s)) {
              return(1)
            }
          }
          return(0)
        }
        
        t <- map(unique(us_videos$title), has_upper)
        print(Reduce("+", t))
## [1] 1318
print(Reduce("+", t) / length(t))
## [1] 0.2041828

Porcentaje de videos con ratings/comentarios deshabilitados

t <- us_videos[us_videos$comments_disabled == "True" | us_videos$ratings_disabled == "True",]
        t <- unique(t$video_id)
        print(length(t))
## [1] 122
print(length(t) / length(unique(us_videos$video_id)))
## [1] 0.01920957

Otras visualizaciones exploratiroas fueron realizadas con otras herramientas, y son presantadas como imagenes a continuacion

Nube de Palabras acerca de los tags

WordCloud de los tags

WordCloud de los tags

Likes vs Dislikes

Likes vs Dislikes

Hito 2

Resumen

  • Inicializar dataset
  • Preprocesado
    • Funciones Utiles
      • Funciones para limpiar urls
      • Funciones de limpieza para cada columna
      • Funciones de conteo
      • Limpieza y Extraccion de Datos
        • LabelEncoder
        • BoW de los Titulos
  • Clasificación
    • Creación de las categorías/clases
    • Features / Conteo
      • Comparación clasificadores
      • Detalle de métricas para Decision Tree
      • Experimientos de distintas columnas
    • BoW de los Titulos
  • Clustering

Inicializar dataset

se crea un nuevo dataframe con las columnas que nos sirven y que no necesitan preprocesado.

Mas abajo se le agregan mas features

In [1]:
import pandas as pd
import numpy as np
import io

path_to_file = 'USvideos.xlsx'

dataset = pd.read_excel(path_to_file)
print(dataset.columns)
print(dataset.shape) #(40949, 16)

clean_columns = [
  'category_id',
  'comment_count',
  #'comments_disabled',
  #'ratings_disabled',
  #'views',
]



dataset = dataset[dataset['ratings_disabled']==False]
dataset = dataset[dataset['likes'].notnull()]
dataset = dataset[dataset['dislikes'].notnull()]


clean_df = dataset[clean_columns]

print(clean_df.shape) #(40949, 16)
clean_df.head()
Index(['video_id', 'trending_date', 'title', 'channel_title', 'category_id',
       'publish_time', 'tags', 'views', 'likes', 'dislikes', 'comment_count',
       'thumbnail_link', 'comments_disabled', 'ratings_disabled',
       'video_error_or_removed', 'description'],
      dtype='object')
(40949, 16)
(40780, 2)
Out[1]:
category_id comment_count
0 22 15954
1 24 12703
2 23 8181
3 24 2146
4 24 17518

Preprocesado

se limpia y agregan features a clean_df. es necesario correr toda la seccion, lo de Bag of Words no

Funciones Utiles

Funciones para limpiar urls

In [13]:
import re
from urllib.parse import urlparse

url_re = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/"

def delete_urls(text):
    return re.sub(url_re, "", text)

# genera lista con las urls en un texto
def urls_in_text(text):
    return [url_array[0] for url_array in re.findall(url_re, text)]

# ['http://www.cwi.nl:80/%7Eguido/Python.html',] >> ['www.cwi.nl:80',]
# solo funciona cuando tienen http:// , https://
def urls2netloc(urls):
    res = []
    for u in urls:
        res.append(urlparse(u).netloc) #print(u,urlparse(u))
    return res

# remplaza las urls en un texto, dejando los netloc
def replace_urls(text):
    urls = urls_in_text(text)
    netlocs = urls2netloc(urls)
    # print(urls) # print(netlocs)
    for u, n in zip(urls, netlocs):
        text = text.replace(u, n)
    return text

# example = '''
# blabllablbl http://www.dasd.com/asd \n 
# blab llablblblabllablbl blabllablbl https://xdasd.net/asdasdasds
# https://www.youtub.com/123123
# '''
# replace_urls(example)

Funciones de Limpieza para cada columna

Retornan el string limpio

In [14]:
def clean_description(de):
  clean_str = de.replace('\n', ' \n ')
  clean_str = replace_urls(clean_str)
  return clean_str

def clean_title(ti):
  # nada que limpiar ?
  return ti

# se cambiara el formato para que funcione con el BoW
# string uno"|"str2"|"... >> string_uno str2 ...
def clean_tags(ta):
  clean_str = ta.replace(' ', '_')
  clean_str = clean_str.replace('"|"', ' ')
  return clean_str

Funciones de conteo

In [15]:
def urls_count(text):
  return len(urls_in_text(text))

def get_ratio(a, b):
  if b == 0 :
    return -1
  return a/b


def uppercase_count(text):
  return sum(1 for c in text if c.isupper())

def spaces_count(text):
  return sum(1 for c in text if c.isspace())

def numbers_count(text):
  return sum(1 for c in text if c.isdigit())

def words_count(text):
  return sum(1 for c in text if c.isalpha())

Limpieza y Extraccion de features

Limpieza y conteo de strings

In [16]:
cleaned = {
  'desc': [],
  'title': [],
  'tags': [],
}

new = {
  'desc': {
      'url_cnt': [],
      'question_cnt': [],
      'exclamation_cnt': [],
      'spaces_cnt': [],
      'numbers_cnt': [],
      'words_cnt': [],
      'uppercase_ratio': [],
      'len': [],
  },
  'title':  {
      'question_cnt': [],
      'exclamation_cnt': [],
      'spaces_cnt': [],
      'numbers_cnt': [],
      'words_cnt': [],
      #'uppercase_cnt': [],
      'uppercase_ratio': [],
      'len': [],
  },
  'tags':  {
      'cnt': [],
  },
}


iterator = zip(
    dataset['description'],
    dataset['title'],
    dataset['tags'])

for de, ti, ta in iterator:
  
  if type(de) is float:
    de = "" 
  if type(ti) is float:
    ti = "" 
  if type(ta) is float:
    ta = ""  

  cleaned_de = clean_description(de)
  cleaned_ti = clean_title(de)
  cleaned_ta = clean_tags(ta)

  cleaned['desc'].append(cleaned_de)
  cleaned['title'].append(cleaned_ti)
  cleaned['tags'].append(cleaned_ta)

  
  new['desc']['url_cnt'].append(urls_count(de))
  new['desc']['question_cnt'].append(cleaned_de.count('?'))
  new['desc']['exclamation_cnt'].append(cleaned_de.count('!'))
  new['desc']['spaces_cnt'].append(spaces_count(cleaned_de))
  new['desc']['numbers_cnt'].append(numbers_count(cleaned_de))
  new['desc']['words_cnt'].append(words_count(cleaned_de))
  new['desc']['uppercase_ratio'].append(get_ratio(uppercase_count(cleaned_de), 
                                                  len(cleaned_de)))  
  new['desc']['len'].append(len(de))

  new['title']['question_cnt'].append(cleaned_ti.count('?'))
  new['title']['exclamation_cnt'].append(cleaned_ti.count('!'))
  new['title']['spaces_cnt'].append(spaces_count(cleaned_ti))
  new['title']['numbers_cnt'].append(numbers_count(cleaned_ti))
  new['title']['words_cnt'].append(words_count(cleaned_ti))
  #new['title']['uppercase_cnt'].append(uppercase_count(cleaned_ti))
  new['title']['uppercase_ratio'].append(get_ratio(uppercase_count(cleaned_ti), 
                                                  len(cleaned_ti)))
  new['title']['len'].append(len(cleaned_ti))

  new['tags']['cnt'].append(cleaned_ta.count(' '))

           
''' 
clean_df['description'] = cleaned['desc']
clean_df['title'] = cleaned['title']
clean_df['tags'] = cleaned['tags']
'''

clean_df['desc_url_cnt'] = new['desc']['url_cnt']
clean_df['desc_question_cnt'] = new['desc']['question_cnt']
clean_df['desc_exclamation_cnt'] = new['desc']['exclamation_cnt']
clean_df['desc_spaces_cnt'] = new['desc']['spaces_cnt']
clean_df['desc_numbers_cnt'] = new['desc']['numbers_cnt']
clean_df['desc_words_cnt'] = new['desc']['words_cnt']
clean_df['desc_uppercase_ratio'] = new['desc']['uppercase_ratio']
clean_df['desc_len'] = new['desc']['len']

clean_df['title_question_cnt'] = new['title']['question_cnt']
clean_df['title_exclamation_cnt'] = new['title']['exclamation_cnt']
clean_df['title_spaces_cnt'] = new['title']['spaces_cnt']
clean_df['title_numbers_cnt'] = new['title']['numbers_cnt']
clean_df['title_words_cnt'] = new['title']['words_cnt']
#clean_df['title_uppercase_cnt'] = new['title']['uppercase_cnt']
clean_df['title_uppercase_ratio'] = new['title']['uppercase_ratio']
clean_df['title_len'] = new['title']['len']

clean_df['tags_cnt'] = new['tags']['cnt']



print(clean_df.shape) # (40780, 18)
clean_df.head()
(40780, 18)
Out[16]:
category_id comment_count desc_url_cnt desc_question_cnt desc_exclamation_cnt desc_spaces_cnt desc_numbers_cnt desc_words_cnt desc_uppercase_ratio desc_len title_question_cnt title_exclamation_cnt title_spaces_cnt title_numbers_cnt title_words_cnt title_uppercase_ratio title_len tags_cnt
0 22 15954 0 0 0 135 23 1072 0.175887 1410 0 0 135 23 1072 0.175887 1410 0
1 24 12703 0 0 0 79 0 510 0.053968 630 0 0 79 0 510 0.053968 630 3
2 23 8181 0 1 2 73 4 860 0.116398 1177 1 2 73 4 860 0.116398 1177 21
3 24 2146 0 2 0 122 15 1096 0.066287 1403 2 0 122 15 1096 0.066287 1403 26
4 24 17518 0 0 3 52 11 490 0.037736 636 0 3 52 11 490 0.037736 636 13

LabelEncoder

Este transorma strings en un identificador unico, util para el nombre del canal, pero no se utilizo para el analisis ya no nos interesa este valor para relizar las predicciones

In [17]:
"""
from sklearn.preprocessing import LabelEncoder
y2 = dataset['channel_title']
lb = LabelEncoder()
clean_df = clean_df.assign(channel_title = lb.fit_transform(y2))
 
"""
Out[17]:
"\nfrom sklearn.preprocessing import LabelEncoder\ny2 = dataset['channel_title']\nlb = LabelEncoder()\nclean_df = clean_df.assign(channel_title = lb.fit_transform(y2))\n \n"

BoW de los Titulos

  • 381751 bigramas sin quitar stopwords # min_df=1
  • 352254 bigramas quitando stopwords en ingles #stop_words ='english',
  • 256310 bigramas cuando solo se usan al tener mas de 5 aparicionos # min_df=5
  • 80 bigramas cuando aparecen en el 10% de los titulos # min_df=0.1
In [19]:
## 2 grama
from sklearn.feature_extraction.text import CountVectorizer

bigram_vectorizer = CountVectorizer(ngram_range=(1, 2),
                                    token_pattern=r'\b\w+\b',
                                    stop_words ='english',
                                    min_df=0.1)

Xtitle_2 = bigram_vectorizer.fit_transform(cleaned['title'])
print(Xtitle_2.shape) # == (40780, 381751) == videos vs bigramas

'''
# total de apariciones por bigrama
title_bigram_total = Xtitle_2.sum(axis=0) 
bigram_total.shape == (1, 381751)

# total de bigramas por video
title_video_total = Xtitle_2.sum(axis=1) 
bigram_total.shape == (40780, 1)
'''

title_feature_names= bigram_vectorizer.get_feature_names()
# print(len(title_feature_names)) #381751

TitleBOW = pd.DataFrame(Xtitle_2.toarray()).fillna(0)
TitleBOW.head()
(40780, 80)
Out[19]:
0 1 2 3 4 5 6 7 8 9 ... 70 71 72 73 74 75 76 77 78 79
0 0 0 0 0 0 1 16 1 0 0 ... 0 0 3 1 0 0 1 0 1 1
1 0 0 0 0 0 1 4 0 1 0 ... 0 0 2 0 0 0 1 0 2 1
2 1 0 0 0 0 1 19 0 1 0 ... 0 0 1 0 0 0 1 0 15 15
3 1 1 0 2 2 0 13 0 0 0 ... 0 0 4 0 0 0 0 2 3 3
4 0 0 0 0 0 1 7 1 0 0 ... 0 0 6 1 1 1 1 0 1 1

5 rows × 80 columns

Clasificacion

Creacion de las categorias/clases

Estas categorias se reparten arbitrariamente, podrian separarse de mejor manera.

In [20]:
def clean_nans(df):
  res = df
  for c in df.columns:
    cnt = len(res[c])
    res = res[res[c].notnull()]
    if(cnt != len(res)):
      print(c,'=',cnt - len(res[c]), 'nulls')
  return res


target_columns = [
   'likes',
   'dislikes',
   'views',
]

target_df = dataset[target_columns]

#target_df['likes_ratio'] = (target_df['likes']+0)/(target_df['dislikes']+0)
likes_ratio = []
for l,d in zip(target_df['likes'], target_df['dislikes']):
  likes_ratio.append(get_ratio(l,d))
target_df = target_df.assign(likes_ratio = likes_ratio)


bins = [-np.inf, 0.5, 1.5, 15, 50.0, 100.0, 200.0, np.inf]
labels=['<0.5','<1.5','<15','<50','<100','<200','>200']
target_df['likes_ratio_category'] = pd.cut(target_df['likes_ratio'], bins=bins, labels=labels)

bins = [-np.inf, 500000, 1000000, 10000000, 50000000, np.inf]
labels=['0.5M','1M','10M','50M','>50M']
target_df['views_category'] = pd.cut(target_df['views'], bins=bins, labels=labels)


target_df.head()

X = clean_nans(clean_df)
Y = clean_nans(target_df)

Features / Conteo

Comparación clasificadores

In [21]:
from sklearn.datasets import load_breast_cancer
from sklearn.dummy import DummyClassifier
from sklearn.svm import SVC  # support vector machine classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB  # naive bayes
from sklearn.neighbors import KNeighborsClassifier


import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, recall_score, precision_score


def run_classifier(clf, X_train, X_test, y_train, y_test, num_tests=100):
    metrics = {'f1-score': [], 'precision': [], 'recall': []}
    for _ in range(num_tests):       
        clf.fit(X_train, y_train)    ## Entrenamos con X_train y clases y_train
        predictions = clf.predict(X_test)   ## Predecimos con nuevos datos (los de test X_test)
        
        metrics['y_pred'] = predictions
        metrics['y_prob'] = clf.predict_proba(X_test)[:,1]
        metrics['f1-score'].append(f1_score(y_test, predictions, average='weighted')) 
        metrics['recall'].append(recall_score(y_test, predictions, average='weighted'))
        metrics['precision'].append(precision_score(y_test, predictions, average='weighted'))
    return metrics

  
classifiers = [
  ("Base Dummy", DummyClassifier(strategy='stratified')),
  ("Decision Tree", DecisionTreeClassifier()),
  ("Gaussian Naive Bayes", GaussianNB()),
  ("KNN", KNeighborsClassifier(n_neighbors=5)),
]

target_classes = [
  ('Likes/Dislikes ratio Category', 'likes_ratio_category'),
  ('Views Category', 'views_category'),
]


X = clean_df 

for tname, col in target_classes:
  print('Clasificando: {}'.format(tname))
  y = target_df[col]  
  
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.60)
  results = {}
  for name, clf in classifiers:
      metrics = run_classifier(clf, X_train, X_test, y_train, y_test)   # hay que implementarla en el bloque anterior.
      results[name] = metrics
      print("----------------")
      print("Resultados para clasificador: ",name) 
      print("Precision promedio:",np.array(metrics['precision']).mean())
      print("Recall promedio:",np.array(metrics['recall']).mean())
      print("F1-score promedio:",np.array(metrics['f1-score']).mean())
      print("----------------\n\n")
Clasificando: Likes/Dislikes ratio Category
----------------
Resultados para clasificador:  Base Dummy
Precision promedio: 0.29130448279397003
Recall promedio: 0.2903616969102501
F1-score promedio: 0.2908174023441908
----------------


----------------
Resultados para clasificador:  Decision Tree
Precision promedio: 0.863015241196066
Recall promedio: 0.8631441065881971
F1-score promedio: 0.8630219904848322
----------------


----------------
Resultados para clasificador:  Gaussian Naive Bayes
Precision promedio: 0.2970287395339853
Recall promedio: 0.11218734673859737
F1-score promedio: 0.14631408755502107
----------------


----------------
Resultados para clasificador:  KNN
Precision promedio: 0.531804166620094
Recall promedio: 0.5341262056563675
F1-score promedio: 0.5298838506214547
----------------


Clasificando: Views Category
----------------
Resultados para clasificador:  Base Dummy
Precision promedio: 0.3354724737355357
Recall promedio: 0.3362457086807259
F1-score promedio: 0.33578344603596766
----------------


----------------
Resultados para clasificador:  Decision Tree
Precision promedio: 0.8541902543879519
Recall promedio: 0.8537551087134214
F1-score promedio: 0.8538358071565976
----------------


----------------
Resultados para clasificador:  Gaussian Naive Bayes
Precision promedio: 0.5275998126668696
Recall promedio: 0.3995831289847964
F1-score promedio: 0.42190877201508337
----------------


----------------
Resultados para clasificador:  KNN
Precision promedio: 0.7196978924331467
Recall promedio: 0.7307503678273665
F1-score promedio: 0.7214834697682656
----------------


Detalle de metricas para Decision Tree

In [22]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report


target_classes = [
    ('Likes/Dislikes ratio Category', 'likes_ratio_category'),
    ('Views Category', 'views_category'),
]

X = clean_df 

for name, col in target_classes:
  print('Clasificando: {}'.format(name))
  y = target_df[col]  

  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33, random_state=37, stratify=y)
  clf = DecisionTreeClassifier()
  clf.fit(X_train, y_train)   

  y_pred = clf.predict(X_test)
  print("Accuracy en test set:", accuracy_score(y_test, y_pred))
  print(classification_report(y_test, y_pred))
Clasificando: Likes/Dislikes ratio Category
Accuracy en test set: 0.9073413582998959
              precision    recall  f1-score   support

        <0.5       0.82      0.80      0.81       148
        <1.5       0.84      0.87      0.86       234
        <100       0.89      0.87      0.88      2673
         <15       0.92      0.92      0.92      3418
        <200       0.88      0.83      0.85      1063
         <50       0.91      0.94      0.93      5709
        >200       0.90      0.78      0.84       213

   micro avg       0.91      0.91      0.91     13458
   macro avg       0.88      0.86      0.87     13458
weighted avg       0.91      0.91      0.91     13458

Clasificando: Views Category
Accuracy en test set: 0.886312973695943
              precision    recall  f1-score   support

        0.5M       0.93      0.91      0.92      5636
         10M       0.90      0.92      0.91      4775
          1M       0.76      0.75      0.75      2444
         50M       0.89      0.91      0.90       546
        >50M       0.91      0.91      0.91        57

   micro avg       0.89      0.89      0.89     13458
   macro avg       0.88      0.88      0.88     13458
weighted avg       0.89      0.89      0.89     13458

Experimientos de distintas columnas

In [24]:
import warnings

warnings.filterwarnings('ignore')

clean_df.columns == ['category_id', 'comment_count', 'desc_url_cnt', 'desc_question_cnt',
       'desc_exclamation_cnt', 'desc_spaces_cnt', 'desc_numbers_cnt',
       'desc_words_cnt', 'desc_uppercase_ratio', 'desc_len',
       'title_question_cnt', 'title_exclamation_cnt', 'title_spaces_cnt',
       'title_numbers_cnt', 'title_words_cnt', 'title_uppercase_ratio',
       'title_len', 'tags_cnt'
       ]



from sklearn.datasets import load_breast_cancer
from sklearn.dummy import DummyClassifier
from sklearn.svm import SVC  # support vector machine classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB  # naive bayes
from sklearn.neighbors import KNeighborsClassifier


cols_groups =[
  ( "original",
    [
      'category_id', 'comment_count', 'desc_url_cnt', 'desc_question_cnt',
      'desc_exclamation_cnt', 'desc_spaces_cnt', 'desc_numbers_cnt',
      'desc_words_cnt', 'desc_uppercase_ratio', 'desc_len',
      'title_question_cnt', 'title_exclamation_cnt', 'title_spaces_cnt',
      'title_numbers_cnt', 'title_words_cnt', 'title_uppercase_ratio',
      'title_len', 'tags_cnt'
    ]
  ),

 ( "sin tags",
  [
    'category_id', 'comment_count', 'desc_url_cnt', 'desc_question_cnt',
    'desc_exclamation_cnt', 'desc_spaces_cnt', 'desc_numbers_cnt',
    'desc_words_cnt', 'desc_uppercase_ratio', 'desc_len',
    'title_question_cnt', 'title_exclamation_cnt', 'title_spaces_cnt',
    'title_numbers_cnt', 'title_words_cnt', 'title_uppercase_ratio',
    'title_len'
  ]),

  ( "sin descripcion",
    [
      'category_id', 'comment_count',
      'title_question_cnt', 'title_exclamation_cnt', 'title_spaces_cnt',
      'title_numbers_cnt', 'title_words_cnt', 'title_uppercase_ratio',
      'title_len', 'tags_cnt'
    ]
  ),
    
   ( "sin titulo",
    [
      'category_id', 'comment_count', 'desc_url_cnt', 'desc_question_cnt',
      'desc_exclamation_cnt', 'desc_spaces_cnt', 'desc_numbers_cnt',
      'desc_words_cnt', 'desc_uppercase_ratio', 'desc_len',
      'tags_cnt',
    ]
  )
    
]

classifiers = [
  ("Base Dummy", DummyClassifier(strategy='stratified')),
  ("Decision Tree", DecisionTreeClassifier()),
  ("Gaussian Naive Bayes", GaussianNB()),
  ("KNN", KNeighborsClassifier(n_neighbors=5)),
]

target_classes = [
    ('Likes/Dislikes ratio Category', 'likes_ratio_category'),
    ('Views Category', 'views_category'),
]


print('clasificador, target, cols_description, precision, recall, f1-score')
for tname, col in target_classes:
  for cname, clf in classifiers:
    for gdescription, group in cols_groups:
      
      X = clean_df[group]
      y = target_df[col] 
      X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33, random_state=37, stratify=y)

      metrics = run_classifier(clf, X_train, X_test, y_train, y_test)   # hay que implementarla en el bloque anterior.
      print("{}, {}, {}, {}, {}, {}, ".format(
          cname,
          tname,
          gdescription,
          np.array(metrics['precision']).mean(),
          np.array(metrics['recall']).mean(),
          np.array(metrics['f1-score']).mean()
        )
      )
clasificador, target, cols_description, precision, recall, f1-score
Base Dummy, Likes/Dislikes ratio Category, original, 0.2910650710135786, 0.2909080101055135, 0.2909724650419197, 
Base Dummy, Likes/Dislikes ratio Category, sin tags, 0.2910785817382675, 0.2910848565908753, 0.2910672285567452, 
Base Dummy, Likes/Dislikes ratio Category, sin descripcion, 0.2908763762890059, 0.29086417001040277, 0.290858336274859, 
Base Dummy, Likes/Dislikes ratio Category, sin titulo, 0.29057068152107474, 0.2904696091544063, 0.2905089145497421, 
Decision Tree, Likes/Dislikes ratio Category, original, 0.90653436393915, 0.906836825679893, 0.9064931853401336, 
Decision Tree, Likes/Dislikes ratio Category, sin tags, 0.898350886881078, 0.8983481943825236, 0.898164661935243, 
Decision Tree, Likes/Dislikes ratio Category, sin descripcion, 0.9049898695222146, 0.9052801307772329, 0.9049442790649455, 
Decision Tree, Likes/Dislikes ratio Category, sin titulo, 0.9057620046869519, 0.9060982315351463, 0.9057648510129401, 
Gaussian Naive Bayes, Likes/Dislikes ratio Category, original, 0.27743378904863136, 0.11301827909050381, 0.14733879710779557, 
Gaussian Naive Bayes, Likes/Dislikes ratio Category, sin tags, 0.2851420116093783, 0.11012037449843959, 0.14463974808475039, 
Gaussian Naive Bayes, Likes/Dislikes ratio Category, sin descripcion, 0.3048887471709926, 0.2010699955416853, 0.23111347639254956, 
Gaussian Naive Bayes, Likes/Dislikes ratio Category, sin titulo, 0.3044901302565929, 0.20099569029573489, 0.23094828113683694, 
KNN, Likes/Dislikes ratio Category, original, 0.5956521403825823, 0.5972655669490265, 0.5944218541746994, 
KNN, Likes/Dislikes ratio Category, sin tags, 0.5915374837261078, 0.5939218308812603, 0.5906920619208208, 
KNN, Likes/Dislikes ratio Category, sin descripcion, 0.5705805008706444, 0.5725962252935057, 0.5694585044110527, 
KNN, Likes/Dislikes ratio Category, sin titulo, 0.5712557402021254, 0.5729677515232574, 0.569906684667086, 
Base Dummy, Views Category, original, 0.3360392192239315, 0.33595928072521913, 0.3359863490806703, 
Base Dummy, Views Category, sin tags, 0.3352093518528806, 0.335197651954228, 0.3351925848700018, 
Base Dummy, Views Category, sin descripcion, 0.33540150008961256, 0.3354606925248924, 0.3354193416677285, 
Base Dummy, Views Category, sin titulo, 0.33567214725504635, 0.3357445385644226, 0.33569570778928015, 
Decision Tree, Views Category, original, 0.8859560489949448, 0.885832218754644, 0.8858425107434346, 
Decision Tree, Views Category, sin tags, 0.8848009487697993, 0.8847206122752267, 0.8846732024259351, 
Decision Tree, Views Category, sin descripcion, 0.8857063314470737, 0.8851396938623866, 0.8853547356392764, 
Decision Tree, Views Category, sin titulo, 0.8849403438780873, 0.8849435280130777, 0.8849035339419947, 
Gaussian Naive Bayes, Views Category, original, 0.5232417786029844, 0.4312676474959133, 0.4374419493260593, 
Gaussian Naive Bayes, Views Category, sin tags, 0.5240482649800747, 0.43141625798781386, 0.4377382046710456, 
Gaussian Naive Bayes, Views Category, sin descripcion, 0.5664710971371771, 0.5397533065834448, 0.5124119842861878, 
Gaussian Naive Bayes, Views Category, sin titulo, 0.5670941097000224, 0.5407192747807994, 0.5132740858647075, 
KNN, Views Category, original, 0.7524560839976111, 0.760439887056026, 0.7538406683811378, 
KNN, Views Category, sin tags, 0.7502230290536684, 0.7584336454153663, 0.7517379731739819, 
KNN, Views Category, sin descripcion, 0.7417391253943099, 0.7506315945905778, 0.7432118829568768, 
KNN, Views Category, sin titulo, 0.7428215691329337, 0.7517461732798334, 0.7443572192618576, 

BoW de los Titulos

In [25]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report


classifiers = [
  ("Base Dummy", DummyClassifier(strategy='stratified')),
  ("Decision Tree", DecisionTreeClassifier()),
  ("Gaussian Naive Bayes", GaussianNB()),
  ("KNN", KNeighborsClassifier(n_neighbors=5)),
]

target_classes = [
    ('Likes/Dislikes ratio Category', 'likes_ratio_category'),
    ('Views Category', 'views_category'),
]


X = TitleBOW

print('clasificador, target, precision, recall, f1-score')
for tname, col in target_classes:
  for cname, clf in classifiers:
    y = target_df[col] 
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33, random_state=37, stratify=y)

    metrics = run_classifier(clf, X_train, X_test, y_train, y_test)
    print("{}, {}, {}, {}, {}, ".format(
        cname,
        tname,
        np.array(metrics['precision']).mean(),
        np.array(metrics['recall']).mean(),
        np.array(metrics['f1-score']).mean()
      )
    )
clasificador, target, precision, recall, f1-score
Base Dummy, Likes/Dislikes ratio Category, 0.29129255367636114, 0.2914348342993015, 0.2913510019404728, 
Decision Tree, Likes/Dislikes ratio Category, 0.8513652018054291, 0.8493260514192303, 0.8479357800646984, 
Gaussian Naive Bayes, Likes/Dislikes ratio Category, 0.39115108693039446, 0.15165700698469317, 0.19517524922388343, 
KNN, Likes/Dislikes ratio Category, 0.7911272379381749, 0.7892703224847676, 0.7878215950501386, 
Base Dummy, Views Category, 0.33544406723356224, 0.3354673799970278, 0.33544430023435917, 
Decision Tree, Views Category, 0.8250283755672487, 0.8289901917075347, 0.8255508092478927, 
Gaussian Naive Bayes, Views Category, 0.43442495961481387, 0.26110863426957936, 0.30680706161124893, 
KNN, Views Category, 0.7874921431158621, 0.7921682270768315, 0.7883668086309785, 

Clustering

In [26]:
import matplotlib.pyplot as plt
import scipy.cluster.hierarchy as shc
import seaborn as sns

from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import normalize

columns = ['likes', 'dislikes', 'views', 'comment_count']
X = dataset[columns]
In [28]:
# Matriz de correlación

cor = X.corr()
sns.heatmap(cor, square=True, vmin=0)
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f46279b6320>
In [29]:
# Matriz de correlación

cor = clean_df.corr()
plt.figure(figsize=(15, 15))
sns.heatmap(cor, square=True, vmin=0)
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4627913b38>
In [30]:
# Matriz de correlación

cor = target_df.corr()
sns.heatmap(cor, square=True, vmin=0)
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f46278b5fd0>
In [31]:
# Matriz de correlación

clean_df.columns == ['category_id', 'comment_count', 'desc_url_cnt', 'desc_question_cnt',
       'desc_exclamation_cnt', 'desc_spaces_cnt', 'desc_numbers_cnt',
       'desc_words_cnt', 'desc_uppercase_ratio', 'desc_len',
       'title_question_cnt', 'title_exclamation_cnt', 'title_spaces_cnt',
       'title_numbers_cnt', 'title_words_cnt', 'title_uppercase_ratio',
       'title_len', 'tags_cnt']

target_df.columns == ['likes', 'dislikes', 'views', 'likes_ratio', 'likes_ratio_category',
       'views_category']

pd_total =pd.concat([clean_df, target_df], axis=1)
cor = pd_total.corr()
plt.figure(figsize=(15, 15))
sns.heatmap(cor, square=True, vmin=0)
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4627840a20>
In [32]:
# small

clean_df.columns == ['category_id', 'comment_count', 'desc_url_cnt', 'desc_question_cnt',
       'desc_exclamation_cnt', 'desc_spaces_cnt', 'desc_numbers_cnt',
       'desc_words_cnt', 'desc_uppercase_ratio', 'desc_len',
       'title_question_cnt', 'title_exclamation_cnt', 'title_spaces_cnt',
       'title_numbers_cnt', 'title_words_cnt', 'title_uppercase_ratio',
       'title_len', 'tags_cnt']

target_df.columns == ['likes', 'dislikes', 'views', 'likes_ratio', 'likes_ratio_category',
       'views_category']

cols = ['category_id', 'comment_count', 'desc_len',
       'title_question_cnt', 'title_exclamation_cnt',
       'title_numbers_cnt', 'title_words_cnt', 'title_uppercase_ratio',
       'tags_cnt',
       'likes_ratio','views'
       ]

pd_total =pd.concat([clean_df, target_df], axis=1)
small_df = pd_total[cols]

cor = small_df.corr()
plt.figure(figsize=(15, 15))
sns.heatmap(cor, square=True, vmin=0)
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4627772eb8>
In [33]:
import matplotlib.pyplot as plt
import seaborn as sns; sns.set(style="ticks", color_codes=True)

cols = [
  'category_id', 'comment_count', 'desc_len',
  'title_question_cnt', 'title_exclamation_cnt',
  'title_numbers_cnt', 'title_words_cnt', 'title_uppercase_ratio',
  'tags_cnt',
  'likes_ratio','views'
  ]

pd_total =pd.concat([clean_df, target_df], axis=1)
small_df = pd_total[cols]

                 

g = sns.pairplot(small_df) # Parametro kind="reg" agrega una recta
plt.show()
In [34]:
# Clustering con un sample de 3000

from mpl_toolkits.mplot3d import Axes3D

def cluster_data(data, sample_size, seed, cluster_size):
    plt.figure(figsize=(15, 10))
    data_sample = data.sample(sample_size, random_state=seed)
    normalized = data_sample / data_sample.max()
    
    dend = shc.dendrogram(shc.linkage(normalized, method='ward'))
    plt.show()

    cluster = AgglomerativeClustering(n_clusters=cluster_size, affinity='euclidean', linkage='ward')
    cluster.fit_predict(normalized)

    fig = plt.figure(figsize=(10, 7))
    ax = Axes3D(fig)
    ax.scatter(data_sample.values[:,0],
               data_sample.values[:,1],
               data_sample.values[:,2],
               c=cluster.labels_,
               cmap='rainbow')
    attr = list(data_sample)
    ax.set_xlabel(attr[0])
    ax.set_ylabel(attr[1])
    ax.set_zlabel(attr[2])
    plt.show()
    
    data_sample['cluster'] = cluster.labels_
    
    for i in range(cluster_size):
        c = data_sample[data_sample['cluster'] == i]
        print('cluster ' + str(i) + ':')
        print('cluster size: ' + str(len(c)))
        for j in attr:
            print('avg ' + j + ': ' + str(c[j].mean()))
        print()
    
cluster_data(X, 3000, 15, 4)
cluster 0:
cluster size: 5
avg likes: 3660809.6
avg dislikes: 386043.6
avg views: 133785922.4
avg comment_count: 555433.2

cluster 1:
cluster size: 2882
avg likes: 41113.37404580152
avg dislikes: 1664.6939625260236
avg views: 1325791.7192921583
avg comment_count: 4320.817487855656

cluster 2:
cluster size: 15
avg likes: 1766438.2666666666
avg dislikes: 69262.4
avg views: 60601880.53333333
avg comment_count: 176513.53333333333

cluster 3:
cluster size: 98
avg likes: 514065.9387755102
avg dislikes: 25095.836734693876
avg views: 15888275.5
avg comment_count: 49984.26530612245

In [35]:
# Clustering de videos unicos

Y = dataset.copy()
Y['trending_date'] = pd.to_datetime(Y['trending_date'], format='%y.%d.%m')
Y = Y.sort_values(by='trending_date')
Y = Y.drop_duplicates(subset=['video_id'], keep='last')
Y = Y[columns]

cluster_data(Y, 5000, 17, 4)
cluster 0:
cluster size: 11
avg likes: 2145333.1818181816
avg dislikes: 373789.8181818182
avg views: 98338802.54545455
avg comment_count: 325303.45454545453

cluster 1:
cluster size: 697
avg likes: 179142.56527977044
avg dislikes: 8327.718794835007
avg views: 5746786.824964132
avg comment_count: 18699.979913916784

cluster 2:
cluster size: 61
avg likes: 1000256.9344262296
avg dislikes: 44171.11475409836
avg views: 30650139.81967213
avg comment_count: 117992.98360655738

cluster 3:
cluster size: 4231
avg likes: 15536.699125502246
avg dislikes: 864.782320964311
avg views: 681987.7123611439
avg comment_count: 1900.5769321673363

Cambios en hito 3

  • Clasificacion
    • Creacion de las categorias/clases
    • Features / Conteo
      • Comparación clasificadores
      • Detalle de metricas para Decision Tree
      • Experimientos de distintas columnas
  • Matriz de correlación
  • Regresión

Inicializar dataset

Se hace para datasets de todos los paises

In [1]:
import pandas as pd
import numpy as np
import io


FILE = 'USvideos'
# FILE = 'JPvideos'
# FILE = 'MXvideos'

path_to_file = FILE+'.xlsx'


dataset = pd.read_excel(path_to_file)
print(dataset.columns)
print(dataset.shape) #(40949, 16)

clean_columns = [
  'category_id',
  #'comment_count',
  #'comments_disabled',
  #'ratings_disabled',
  #'views',
]



dataset = dataset[dataset['ratings_disabled']==False]
dataset = dataset[dataset['likes'].notnull()]
dataset = dataset[dataset['dislikes'].notnull()]


clean_df = dataset[clean_columns]

print(clean_df.shape) #(40949, 16)
clean_df.head()
# USA
Index(['video_id', 'trending_date', 'title', 'channel_title', 'category_id',
       'publish_time', 'tags', 'views', 'likes', 'dislikes', 'comment_count',
       'thumbnail_link', 'comments_disabled', 'ratings_disabled',
       'video_error_or_removed', 'description'],
      dtype='object')
(40949, 16)
(40780, 1)
Out[1]:
category_id
0 22
1 24
2 23
3 24
4 24
# JAPON
Index(['video_id', 'trending_date', 'title', 'channel_title', 'category_id',
       'publish_time', 'tags', 'views', 'likes', 'dislikes', 'comment_count',
       'thumbnail_link', 'comments_disabled', 'ratings_disabled',
       'video_error_or_removed', 'description'],
      dtype='object')
(21718, 16)
(19132, 1)
Out[1]:
category_id
0 25.0
1 1.0
2 28.0
3 25.0
4 1.0
# MEXICO
Index(['video_id', 'trending_date', 'title', 'channel_title', 'category_id',
       'publish_time', 'tags', 'views', 'likes', 'dislikes', 'comment_count',
       'thumbnail_link', 'comments_disabled', 'ratings_disabled',
       'video_error_or_removed', 'description'],
      dtype='object')
(44043, 16)
(39817, 1)
Out[0]:
category_id
0 24.0
1 22.0
2 25.0
3 25.0
4 26.0

Preprocesado

Se limpia y agregan features a clean_df.

Funciones Utiles

Funciones para limpiar urls

In [2]:
import re
from urllib.parse import urlparse

url_re = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/"

def delete_urls(text):
    return re.sub(url_re, "", text)

# genera lista con las urls en un texto
def urls_in_text(text):
    return [url_array[0] for url_array in re.findall(url_re, text)]

# ['http://www.cwi.nl:80/%7Eguido/Python.html',] >> ['www.cwi.nl:80',]
# solo funciona cuando tienen http:// , https://
def urls2netloc(urls):
    res = []
    for u in urls:
        res.append(urlparse(u).netloc) #print(u,urlparse(u))
    return res

# remplaza las urls en un texto, dejando los netloc
def replace_urls(text):
    urls = urls_in_text(text)
    netlocs = urls2netloc(urls)
    # print(urls) # print(netlocs)
    for u, n in zip(urls, netlocs):
        text = text.replace(u, n)
    return text

# example = '''
# blabllablbl http://www.dasd.com/asd \n 
# blab llablblblabllablbl blabllablbl https://xdasd.net/asdasdasds
# https://www.youtub.com/123123
# '''
# replace_urls(example)

Funciones de Limpieza para cada columna

Retornan el string limpio

In [3]:
def clean_description(de):
  clean_str = de.replace('\n', ' \n ')
  clean_str = replace_urls(clean_str)
  return clean_str

def clean_title(ti):
  return ti

# se cambiara el formato para que funcione con el BoW
# string uno"|"str2"|"... >> string_uno str2 ...
def clean_tags(ta):
  clean_str = ta.replace(' ', '_')
  clean_str = clean_str.replace('"|"', ' ')
  return clean_str

Funciones de conteo

In [4]:
def urls_count(text):
  return len(urls_in_text(text))

def get_ratio(a, b):
  if b == 0 :
    return -1
  return a/b


def uppercase_count(text):
  return sum(1 for c in text if c.isupper())

def spaces_count(text):
  return sum(1 for c in text if c.isspace())

def numbers_count(text):
  return sum(1 for c in text if c.isdigit())

def words_count(text):
  return sum(1 for c in text if c.isalpha())

Limpieza y Extraccion de Datos

In [5]:
cleaned = {
  'desc': [],
  'title': [],
  'tags': [],
}

new = {
  'desc': {
      'url_cnt': [],
      'question_cnt': [],
      'exclamation_cnt': [],
      'spaces_cnt': [],
      'numbers_cnt': [],
      'words_cnt': [],
      'uppercase_ratio': [],
      'len': [],
  },
  'title':  {
      'question_cnt': [],
      'exclamation_cnt': [],
      'spaces_cnt': [],
      'numbers_cnt': [],
      'words_cnt': [],
      #'uppercase_cnt': [],
      'uppercase_ratio': [],
      'len': [],
  },
  'tags':  {
      'cnt': [],
  },
}


iterator = zip(
    dataset['description'],
    dataset['title'],
    dataset['tags'])

for de, ti, ta in iterator:
  
  if type(de) is float:
    de = "" 
  if type(ti) is float:
    ti = "" 
  if type(ta) is float:
    ta = ""  

  cleaned_de = clean_description(de)
  cleaned_ti = clean_title(de)
  cleaned_ta = clean_tags(ta)

  cleaned['desc'].append(cleaned_de)
  cleaned['title'].append(cleaned_ti)
  cleaned['tags'].append(cleaned_ta)

  
  new['desc']['url_cnt'].append(urls_count(de))
  new['desc']['question_cnt'].append(cleaned_de.count('?'))
  new['desc']['exclamation_cnt'].append(cleaned_de.count('!'))
  new['desc']['spaces_cnt'].append(spaces_count(cleaned_de))
  new['desc']['numbers_cnt'].append(numbers_count(cleaned_de))
  new['desc']['words_cnt'].append(words_count(cleaned_de))
  new['desc']['uppercase_ratio'].append(get_ratio(uppercase_count(cleaned_de), 
                                                  len(cleaned_de)))  
  new['desc']['len'].append(len(de))

  new['title']['question_cnt'].append(cleaned_ti.count('?'))
  new['title']['exclamation_cnt'].append(cleaned_ti.count('!'))
  new['title']['spaces_cnt'].append(spaces_count(cleaned_ti))
  new['title']['numbers_cnt'].append(numbers_count(cleaned_ti))
  new['title']['words_cnt'].append(words_count(cleaned_ti))
  #new['title']['uppercase_cnt'].append(uppercase_count(cleaned_ti))
  new['title']['uppercase_ratio'].append(get_ratio(uppercase_count(cleaned_ti), 
                                                  len(cleaned_ti)))
  new['title']['len'].append(len(cleaned_ti))

  new['tags']['cnt'].append(cleaned_ta.count(' '))

           
''' 
clean_df['description'] = cleaned['desc']
clean_df['title'] = cleaned['title']
clean_df['tags'] = cleaned['tags']
'''

clean_df['desc_url_cnt'] = new['desc']['url_cnt']
clean_df['desc_question_cnt'] = new['desc']['question_cnt']
clean_df['desc_exclamation_cnt'] = new['desc']['exclamation_cnt']
clean_df['desc_spaces_cnt'] = new['desc']['spaces_cnt']
clean_df['desc_numbers_cnt'] = new['desc']['numbers_cnt']
clean_df['desc_words_cnt'] = new['desc']['words_cnt']
clean_df['desc_uppercase_ratio'] = new['desc']['uppercase_ratio']
clean_df['desc_len'] = new['desc']['len']

clean_df['title_question_cnt'] = new['title']['question_cnt']
clean_df['title_exclamation_cnt'] = new['title']['exclamation_cnt']
clean_df['title_spaces_cnt'] = new['title']['spaces_cnt']
clean_df['title_numbers_cnt'] = new['title']['numbers_cnt']
clean_df['title_words_cnt'] = new['title']['words_cnt']
#clean_df['title_uppercase_cnt'] = new['title']['uppercase_cnt']
clean_df['title_uppercase_ratio'] = new['title']['uppercase_ratio']
clean_df['title_len'] = new['title']['len']

clean_df['tags_cnt'] = new['tags']['cnt']


print(clean_df.shape) # (40780, 18)
clean_df.head()
# USA
(40780, 17)
Out[5]:
category_id desc_url_cnt desc_question_cnt desc_exclamation_cnt desc_spaces_cnt desc_numbers_cnt desc_words_cnt desc_uppercase_ratio desc_len title_question_cnt title_exclamation_cnt title_spaces_cnt title_numbers_cnt title_words_cnt title_uppercase_ratio title_len tags_cnt
0 22 0 0 0 135 23 1072 0.175887 1410 0 0 135 23 1072 0.175887 1410 0
1 24 0 0 0 79 0 510 0.053968 630 0 0 79 0 510 0.053968 630 3
2 23 0 1 2 73 4 860 0.116398 1177 1 2 73 4 860 0.116398 1177 21
3 24 0 2 0 122 15 1096 0.066287 1403 2 0 122 15 1096 0.066287 1403 26
4 24 0 0 3 52 11 490 0.037736 636 0 3 52 11 490 0.037736 636 13
# JAPON
(19132, 17)
Out[5]:
category_id desc_url_cnt desc_question_cnt desc_exclamation_cnt desc_spaces_cnt desc_numbers_cnt desc_words_cnt desc_uppercase_ratio desc_len title_question_cnt title_exclamation_cnt title_spaces_cnt title_numbers_cnt title_words_cnt title_uppercase_ratio title_len tags_cnt
0 25.0 0 0 0 2 5 265 0.010309 291 0 0 2 5 265 0.010309 291 9
1 1.0 0 0 0 0 0 0 -1.000000 0 0 0 0 0 0 -1.000000 0 0
2 28.0 0 0 0 0 0 0 -1.000000 0 0 0 0 0 0 -1.000000 0 0
3 25.0 0 2 0 4 15 435 0.043796 548 2 0 4 15 435 0.043796 548 0
4 1.0 0 0 0 1 11 299 0.005666 353 0 0 1 11 299 0.005666 353 0
# MEXICO
(40780, 18)
Out[0]:
category_id comment_count desc_url_cnt desc_question_cnt desc_exclamation_cnt desc_spaces_cnt desc_numbers_cnt desc_words_cnt desc_uppercase_ratio desc_len title_question_cnt title_exclamation_cnt title_spaces_cnt title_numbers_cnt title_words_cnt title_uppercase_ratio title_len tags_cnt
0 22 15954 0 0 0 135 23 1072 0.175887 1410 0 0 135 23 1072 0.175887 1410 0
1 24 12703 0 0 0 79 0 510 0.053968 630 0 0 79 0 510 0.053968 630 3
2 23 8181 0 1 2 73 4 860 0.116398 1177 1 2 73 4 860 0.116398 1177 21
3 24 2146 0 2 0 122 15 1096 0.066287 1403 2 0 122 15 1096 0.066287 1403 26
4 24 17518 0 0 3 52 11 490 0.037736 636 0 3 52 11 490 0.037736 636 13

Clasificacion

Creacion de las categorias/clases

Estas categorias se reparten arbitrariamente, podrian separarse de mejor manera.

In [9]:
def clean_nans(df):
  res = df
  for c in df.columns:
    cnt = len(res[c])
    res = res[res[c].notnull()]
    if(cnt != len(res)):
      print(c,'=',cnt - len(res[c]), 'nulls')
  return res


target_columns = [
   'likes',
   'dislikes',
   'views',
]

target_df = dataset[target_columns]

#target_df['likes_ratio'] = (target_df['likes']+0)/(target_df['dislikes']+0)
likes_ratio = []
for l,d in zip(target_df['likes'], target_df['dislikes']):
  likes_ratio.append(get_ratio(l,d))
target_df = target_df.assign(likes_ratio = likes_ratio)


# bins = [-np.inf, 0.5, 1.5, 15, 50.0, 100.0, 200.0, np.inf]
# labels=['<0.5','<1.5','<15','<50','<100','<200','>200']
# target_df['likes_ratio_category'] = pd.cut(target_df['likes_ratio'], bins=bins, labels=labels)

# bins = [-np.inf, 500000, 1000000, 10000000, 50000000, np.inf]
# labels=['0.5M','1M','10M','50M','>50M']
# target_df['views_category'] = pd.cut(target_df['views'], bins=bins, labels=labels)

n_class = 6
labels = [i for i in range(n_class)]

target_df['likes_ratio_category'] =  pd.qcut(target_df['likes_ratio'], n_class, labels=labels)
target_df['views_category'], q =  pd.qcut(target_df['views'], n_class, labels=labels, retbins=True)

print(q)

target_df.head()

#X = clean_nans(clean_df)
#Y = clean_nans(target_df)
# USA
likes dislikes views likes_ratio likes_ratio_category views_category
0 57527 2966 748374 19.395482 2 3
1 97185 6146 2418783 15.812724 1 4
2 146033 5339 3191434 27.352126 2 5
3 10172 666 343168 15.273273 1 1
4 132235 1989 2095731 66.483157 4 4
# JAPON
likes dislikes views likes_ratio likes_ratio_category views_category
0 591.0 189.0 188085.0 3.126984 0 4
1 442.0 88.0 90929.0 5.022727 1 3
2 165892.0 2331.0 6408303.0 71.167739 5 5
3 1165.0 277.0 96255.0 4.205776 1 3
4 1336.0 74.0 108408.0 18.054054 3 3
# MEXICO
likes dislikes views likes_ratio likes_ratio_category views_category
0 4182.0 361.0 310130.0 11.584488 1 4
1 271.0 174.0 104972.0 1.557471 0 3
2 10105.0 266.0 136064.0 37.988722 3 4
3 378.0 171.0 96153.0 2.210526 0 3
4 57781.0 681.0 499965.0 84.847283 5 5

Features / Conteo

Comparación clasificadores

In [10]:
from sklearn.datasets import load_breast_cancer
from sklearn.dummy import DummyClassifier
from sklearn.svm import SVC  # support vector machine classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB  # naive bayes
from sklearn.neighbors import KNeighborsClassifier


import numpy as np
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import f1_score, recall_score, precision_score


def run_classifier(clf, X_train, X_test, y_train, y_test, num_tests=100):
    metrics = {'f1-score': [], 'precision': [], 'recall': [], 'accuracy': []}
    for _ in range(num_tests):       
        clf.fit(X_train, y_train)    ## Entrenamos con X_train y clases y_train
        predictions = clf.predict(X_test)   ## Predecimos con nuevos datos (los de test X_test)
        
        metrics['y_pred'] = predictions
        metrics['y_prob'] = clf.predict_proba(X_test)[:,1]
        metrics['f1-score'].append(f1_score(y_test, predictions, average='weighted')) 
        metrics['recall'].append(recall_score(y_test, predictions, average='weighted'))
        metrics['precision'].append(precision_score(y_test, predictions, average='weighted'))
        metrics['accuracy'].append(accuracy_score(y_test, predictions,))
    return metrics

  
classifiers = [
  ("Base Dummy", DummyClassifier(strategy='stratified')),
  ("Decision Tree", DecisionTreeClassifier()),
  ("Gaussian Naive Bayes", GaussianNB()),
  ("KNN", KNeighborsClassifier(n_neighbors=5)),
]

target_classes = [
  ('Likes/Dislikes ratio Category', 'likes_ratio_category'),
  ('Views Category', 'views_category'),
]


X = clean_df 

for tname, col in target_classes:
  
  y = target_df[col]  
  results = {}
  for cname, clf in classifiers:

      scoring = ['precision_macro', 'recall_macro', 'accuracy', 'f1_macro']
      cv_results = cross_validate(clf, X, y, cv = 10, scoring = scoring, return_train_score= True)

      print("Cross validation, {}, {}, {}, {}, {}, ".format(
          cname,
          tname,
          np.mean(cv_results['test_precision_macro']),
          np.mean(cv_results['test_recall_macro']),
          np.mean(cv_results['test_f1_macro']),
          np.mean(cv_results['test_accuracy']),
        )
      )
      
      
# USA
Cross validation, Base Dummy, Likes/Dislikes ratio Category, 0.16582516136021672, 0.1688560888272835, 0.16699925864887608, 
Cross validation, Decision Tree, Likes/Dislikes ratio Category, 0.4921111639871506, 0.4887206171128245, 0.48722356979013004, 
Cross validation, Gaussian Naive Bayes, Likes/Dislikes ratio Category, 0.197754796171473, 0.19457272084091368, 0.14322336492392448, 
Cross validation, KNN, Likes/Dislikes ratio Category, 0.3976518923635172, 0.3963706719801323, 0.3954051860387052, 
Cross validation, Base Dummy, Views Category, 0.16537275391955172, 0.1638542117878079, 0.16721727795420482, 
Cross validation, Decision Tree, Views Category, 0.4204001292295986, 0.41785703456640383, 0.4138309381815904, 
Cross validation, Gaussian Naive Bayes, Views Category, 0.20549015297137344, 0.1860473158335499, 0.1228286269329943, 
Cross validation, KNN, Views Category, 0.35087324828551314, 0.3539199803632793, 0.34904974878395184, 
# JAPON
Cross validation, Base Dummy, Likes/Dislikes ratio Category, 0.17008731526041485, 0.1674708597852303, 0.16768430907964624, 
Cross validation, Decision Tree, Likes/Dislikes ratio Category, 0.3312799085061265, 0.3297696682587094, 0.3287783097071283, 
Cross validation, Gaussian Naive Bayes, Likes/Dislikes ratio Category, 0.2480342749334686, 0.18943040008904927, 0.1435481036130647, 
Cross validation, KNN, Likes/Dislikes ratio Category, 0.27894127060517043, 0.2713170138683517, 0.2703034374295706, 
Cross validation, Base Dummy, Views Category, 0.16674184008211068, 0.16563931441940555, 0.16926603908308593, 
Cross validation, Decision Tree, Views Category, 0.3553328962589004, 0.3477532317317613, 0.34795712653165045, 
Cross validation, Gaussian Naive Bayes, Views Category, 0.20812841774523952, 0.20849779512759348, 0.16952615565065834, 
Cross validation, KNN, Views Category, 0.2908427877178004, 0.2854995958281579, 0.2857989449148858, 
# MEXICO
Cross validation, Base Dummy, Likes/Dislikes ratio Category, 0.16894953319397465, 0.16623592181679214, 0.16674940155551873, 
Cross validation, Decision Tree, Likes/Dislikes ratio Category, 0.30266474676655564, 0.3030789453895765, 0.30207815971681595, 
Cross validation, Gaussian Naive Bayes, Likes/Dislikes ratio Category, 0.21528397066559038, 0.21418261611283845, 0.1685891429480186, 
Cross validation, KNN, Likes/Dislikes ratio Category, 0.27868555994823696, 0.27600472942419224, 0.2726866765505369, 
Cross validation, Base Dummy, Views Category, 0.1683177728223062, 0.1675923149612023, 0.1668813194327917, 
Cross validation, Decision Tree, Views Category, 0.3199910155616583, 0.31543216607001157, 0.31427533783944495, 
Cross validation, Gaussian Naive Bayes, Views Category, 0.1743566331497276, 0.20248357684130186, 0.12135909161781167, 
Cross validation, KNN, Views Category, 0.2787188839286823, 0.27670470721498364, 0.2733873170819457, 

Matriz de correlación

In [12]:
import matplotlib.pyplot as plt
import scipy.cluster.hierarchy as shc
import seaborn as sns

from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import normalize

columns = ['likes', 'dislikes', 'views', 'comment_count']
X = dataset[columns]
In [13]:
# Matriz de correlación

clean_df.columns == ['category_id', 'desc_url_cnt', 'desc_question_cnt',
       'desc_exclamation_cnt', 'desc_spaces_cnt', 'desc_numbers_cnt',
       'desc_words_cnt', 'desc_uppercase_ratio', 'desc_len',
       'title_question_cnt', 'title_exclamation_cnt', 'title_spaces_cnt',
       'title_numbers_cnt', 'title_words_cnt', 'title_uppercase_ratio',
       'title_len', 'tags_cnt']

target_df.columns == ['likes', 'dislikes', 'views', 'likes_ratio', 'likes_ratio_category',
       'views_category']

pd_total =pd.concat([clean_df, target_df], axis=1)
cor = pd_total.corr()
plt.figure(figsize=(15, 15))

sns.heatmap(cor, square=True, vmin=0)
# USA
<matplotlib.axes._subplots.AxesSubplot at 0x7f61904a68d0>
# JAPON
<matplotlib.axes._subplots.AxesSubplot at 0x7f8e70cb47f0>
# MEXICO
<matplotlib.axes._subplots.AxesSubplot at 0x7f6ea707ad68>

Regresión

In [14]:
from sklearn.linear_model import LinearRegression
from sklearn.dummy import DummyRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import train_test_split
from sklearn import metrics

cols = [('Likes ratio', 'likes_ratio'), ('Views', 'views')]
regressors = [('Dummy', DummyRegressor()),
              ('Lineal', LinearRegression()),
              ('Random forest', RandomForestRegressor(n_estimators=10))]

X = clean_df 


for tname, col in cols:

  for rname, model in regressors:
    y = target_df[col]

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=5)  

    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    print("{}, {}, {}, {}, {}, {},".format(
      rname,
      tname,
      metrics.mean_absolute_error(y_test, y_pred),
      metrics.mean_squared_error(y_test, y_pred),
      np.sqrt(metrics.mean_squared_error(y_test, y_pred)),
      model.score(X_test, y_test))
    )
# USA
Dummy, Likes ratio, 32.46890929248981, 2521.657280985768, 50.2161057927212, -0.0001475218907416309,
Lineal, Likes ratio, 31.94438839029975, 2473.1171752686023, 49.730445154538906, 0.01910460519702817,
Random forest, Likes ratio, 7.020593785191673, 330.41242452609237, 18.177250191546914, 0.8689508006962394,
Dummy, Views, 2740078.411490656, 54720139635989.31, 7397306.2418686785, -0.00018033150827045932,
Lineal, Views, 2668657.3026237767, 52896082472808.49, 7272969.302341959, 0.033159972633796,
Random forest, Views, 760829.380151852, 10237084343788.0, 3199544.3962833206, 0.8128855211123422,
# JAPON
Dummy, Likes ratio, 37.53899019876439, 5861.826918414592, 76.56256865084002, -0.000713046901789971,
Lineal, Likes ratio, 36.3127565456435, 5654.279656547122, 75.19494435497057, 0.03471878274624285,
Random forest, Likes ratio, 25.96163191396794, 4352.330022969688, 65.97219128518991, 0.2569836163661211,
Dummy, Views, 334291.06262363074, 2079242140333.8582, 1441957.745682535, -6.398550442443529e-05,
Lineal, Views, 313338.9837584121, 1813013840435.173, 1346482.0238069177, 0.12798523468307887,
Random forest, Views, 203209.92245898658, 667126831627.2294, 816778.3246556127, 0.6791285126767677,
# MEXICO
Dummy, Likes ratio, 37.230152833605246, 5998.787803297398, 77.45184183282795, -0.00015120881869390423,
Lineal, Likes ratio, 36.38410936112811, 5917.3604814079235, 76.92438157962613, 0.013424806384554568,
Random forest, Likes ratio, 30.95754685881187, 5627.323934961563, 75.01549129987461, 0.061781309738528134,
Dummy, Views, 458638.0188187429, 4008074060895.1963, 2002017.4976496075, -0.00046714540975156815,
Lineal, Views, 445687.3608611647, 3945092758548.248, 1986225.757195855, 0.015253802560747378,
Random forest, Views, 346822.0926027047, 2582277501097.7124, 1606946.6391569176, 0.3554301240626182,
In [15]:
import matplotlib.pyplot as plt

for tname, col in cols:
    y = target_df[col]
    order = [i for i in range(len(y))]
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=5)  

    model = RandomForestRegressor(n_estimators=10)
    model.fit(X_train, y_train)
    
    def getKey(i):
        return y.iloc[i]

    order.sort(key=getKey)
    
    _y = []
    _x = []
    for i in range(len(y)):
        _y.append(y.iloc[order[i]])
        _x.append(X.iloc[order[i]])
    
    plt.plot(model.predict(_x))
    plt.plot(_y)
    
    plt.show()
    
    plt.plot(model.predict(_x[:len(_x) - 2000]))
    plt.plot(_y[:len(_y) - 2000])
    
    plt.show()
# USA
# JAPON
# MEXICO